Fuzz encoder#52
Draft
sirreal wants to merge 15 commits into
Draft
Conversation
Fuzzes wp_is_valid_utf8(), wp_scrub_utf8(), and their pure-PHP fallbacks against five independent known-good oracles: mbstring, PCRE2, ICU (intl), CPython, and the WHATWG TextDecoder (Node), the last two as persistent subprocesses. All oracles must pass a hand-computed known-answer battery before use; iconv is excluded because libiconv accepts code points above U+10FFFF. Beyond the differentials, internal invariants are checked: validity iff scrub identity, scrub output validity, scrub idempotence, code point counts against the scrubbed length, and chunked _wp_scan_utf8() reconstruction with deterministic resumable-scan budgets. Inputs mix nine deterministic strategies (random bytes, boundary- heavy valid UTF-8, mutations, invalid-atom splices, latin1, UTF-16, ASCII fast-path stress, repeated motifs); every case is reproducible from (seed, case index) alone. Includes a multi-lane runner with stall detection, replay and signature-preserving minimization tools, and a harness self-test that mutation-tests detection against seven classes of deliberately broken implementations.
Four self-contained work-lane documents: extending the encoding fuzzer (utf8_encode/decode fallback differentials before PHP 9 removes the native oracles, the confirmed wp_has_noncharacters PCRE-vs-fallback divergence on ill-formed input, exhaustive code_point_to_utf8_bytes), an independent WP_HTML_Decoder fuzzer against the Dom\HTMLDocument oracle, WP_Token_Map property tests against a naive reference (building on the existing wpTokenMap.php tests), and a one-shot divergence survey of seems_utf8 and wp_check_invalid_utf8.
Extend the encoding fuzzer with targets for _wp_utf8_encode_fallback() and _wp_utf8_decode_fallback(), fuzzing them against mb_convert_encoding (primary) and the deprecated native utf8_encode()/utf8_decode() pair while it still exists, plus round-trip and output-validity invariants. The handoff's premise that native and fallback share semantics on invalid input was falsified during implementation: legacy utf8_decode() groups a well-formed lead byte with its expected continuation length into a single '?' (surrogates, beyond-U+10FFFF, 3/4-byte overlongs, C2 C0), while WordPress deliberately follows mb_convert_encoding's maximal-subpart semantics (the PHP 9 polyfill in compat.php prefers mb; ticket #63863). The native decode oracle is therefore trusted on valid input only — where it provably agrees with mb on every code point — and the divergence is pinned by hand-computed battery vectors instead of fuzzed. Detection is mutation-tested: seven new broken-implementation classes in the smoke test (cp1252-confused encoder, identity encoder, per-byte decoder, valid-input mangler, round-trip violator, null-returning encoder and decoder — the fallbacks are untyped, so non-string returns are reported as target-bad-return rather than silently skipped), and ENCODING_FUZZ_FAULT=encode-cp1252|decode-per-byte exercise the worker → replay → minimize pipeline end to end (minimal counterexamples: '80' and 'E7 B8'). Also records an upstream finding in the handoff: the #63863 PHPUnit test's invalid-input coverage is vacuous (integer interpolation instead of chr(), single-quoted escape sequences, U+E000 boundary off-by-one).
Fuzz wp_has_noncharacters() (PCRE branch) and
_wp_has_noncharacters_fallback() against a trivial mb_str_split/mb_ord
reference oracle, on valid input only. On ill-formed input the public
function's answer depends on which environment branch of utf8.php
loaded — the PCRE branch returns false whenever preg_match fails while
the fallback skips invalid spans and reports noncharacters around them
("\xC0\xEF\xBF\xBE": PCRE false, fallback true). Per the handoff's
option (a), the fuzzer treats behavior as undefined unless
wp_is_valid_utf8() and pins the divergence with a fixed regression
vector in the smoke test; whether core aligns the implementations or
documents the stance remains an open question for the function author.
The reference oracle's battery covers the boundaries and interior of
the U+FDD0–U+FDEF block and the final two code points of EVERY plane
with their lower neighbors — the PCRE implementation enumerates each
plane as a separate hand-typed escape, so a single-plane typo is the
realistic bug class and now has deterministic coverage. The oracle
throws on ill-formed input rather than silently coercing mb_ord(false).
BOUNDARY_CODE_POINTS gains block-interior, adjacent-negative, and
mid-plane code points (seed re-derivation of older findings is
invalidated; documented in the README — artifact replays are
unaffected).
Mutation variants: blind detector, U+FDD0-block miss, over-eager
detector (shared between the smoke test and the new
ENCODING_FUZZ_FAULT=nonchars-miss-fdd0|nonchars-overeager fault modes,
one per target; both verified through worker, replay, and minimize).
Worker environment metadata now records pcre_u (which utf8.php branch
loaded) and the active fault name so injected artifacts can never be
mistaken for real findings.
The function's domain (~1.1M code points) is small enough to test completely instead of fuzzing. The new standalone script checks every code point 0x0-0x10FFFF plus out-of-range probes against the fuzzer's pure-arithmetic encoder (the independent oracle), with an explicit mb_chr( $cp, 'UTF-8' ) consistency cross-check; surrogates and out-of-range values must yield U+FFFD. Runs in ~0.4s. The harness smoke test executes it and proves its detection fires via the script-local ENCODING_FUZZ_FAULT=codepoint-surrogate-qmark variant. Documents an upstream regression (pinned as a labeled KNOWN ISSUE check so the stance cannot silently go stale): since [62424] (#65342, unreleased) the implementation calls mb_chr() without an explicit encoding, inheriting mb_internal_encoding() — which WordPress sets from blog_charset — so non-UTF-8 sites get raw legacy bytes for mappable code points while invalid ones still yield UTF-8 U+FFFD, contradicting the docblock and mixing encodings with the named character reference path. The 6.6.0 original was pure arithmetic and always emitted UTF-8; the same commit changed code point 0 from U+FFFD to NUL. One-line upstream fix: mb_chr( $code_point, 'UTF-8' ). Closes out the extend-encoding-fuzzer handoff: all three sections done, definition of done verified and recorded in the handoff doc.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Trac ticket:
Use of AI Tools
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.